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Abstract 

Research in reinforcement learning has produced algorithms for opti- 
mal decision making under uncertainty that fall within two main types. 
The first employs a Bayesian framework, where optimality improves with 
increased computational time. This is because the resulting planning task 
takes the form of a dynamic programming problem on a belief tree with 
an infinite number of states. The second type employs relatively simple 
algorithm which are shown to suffer small regret within a distribution-free 
framework. This paper presents a lower bound and a high probability up- 
per bound on the optimal value function for the nodes in the Bayesian 
belief tree, which are analogous to similar bounds in POMDPs. The 
bounds are then used to create more efficient strategies for exploring the 
tree. The resulting algorithms are compared with the distribution-free 
algorithm UCB1, as well as a simpler baseline algorithm on multi-armed 
bandit problems. 



1 Introduction 

In recent work [6,7,10,15-17,21], Bayesian methods for exploration in Markov 
decision processes (MDPs) and for solving known partially-observable Markov 
decision processes (POMDPs), as well as for exploration in the latter case, 
have been proposed. All such methods suffer from computational intractability 
problems for most domains of interest. 

The sources of intractability are two-fold. Firstly, there may be no com- 
pact representation of the current belief. This is especially true for POMDPs. 
Secondly, optimally behaving under uncertainty requires that we create an aug- 
mented MDP model in the form of a tree [7] , where the root node is the current 
belief-state pair and children are all possible subsequent belief-state pairs. This 
tree grows large very fast, and it is particularly problematic to grow in the 
case of continuous observations or actions. In this work, we concentrate on the 
second problem - and consider algorithms for expanding the tree. 



* This is a corrected and slightly expanded version of the homonymous paper presented at 
CIMCA'08. 



Since the Bayesian exploration methods require a tree expansion to be per- 
formed, we can view the whole problem as that of nested exploration. For the 
simplest exploration-exploitation trade-off setting, bandit problems, there al- 
ready exist nearly optimal, computationally simple methods [1]. Such methods 
have recently been extended to tree search [12]. This work proposes to take ad- 
vantage of the special structure of belief trees in order to design nearly-optimal 
algorithms for expansion of nodes. In a sense, by recognising that the tree ex- 
pansion problem in Bayesian look-ahead exploration methods is also an optimal 
exploration problem, we develop tree algorithms that can solve this problem effi- 
ciently. Furthermore, we are able to derive interesting upper and lower bounds 
for the value of branches and leaf nodes which can help limit the amount of 
search. The ideas developed are tested in the multi-armed bandit setting for 
which nearly-optimal algorithms already exist. 

The remainder of this section introduces the augmented MDP formalism 
employed within this work and discusses related work. Section 2 discusses tree 
expansion in exploration problems and introduces some useful bounds. These 
bounds are used in the algorithms detailed in Section 3, which are then evaluated 
in Section 4. We conclude with an outlook to further developments. 

1.1 Preliminaries 

We are interested in sequential decision problems where, at each time step t, 
the agent seeks to maximise the expected utility 

oo 
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where r is a stochastic reward and u t is simply the discounted sum of future 
rewards. We shall assume that the sequence of rewards arises from a Markov 
decision process, defined below. 

Definition 1.1 (Markov decision process) A Markov decision process (MDP) 
is defined as the tuple fi = (S, A, T, TZ) comprised of a set of states S, a set of 
actions A, a transition distribution T conditioning the next state on the current 
state and action, 

T(s'\s, a) = fj,(s t+ i=s'\st=s, a t = a) (1) 

satisfying the Markov property fJ,(s t+ i | s t , a t ) = n(s t+ i | s t ,a t , s t -i,a t -i, ■ ■ ■ ), 
and a reward distribution TZ conditioned on states and actions: 

K(r\s, a) = fj,(r t +i=r | s t =s, a t =a), (2) 

with a G A, s, s' G S, r G BL Finally, 

ti(rt+i,St+i\s t ,a t ) = v(r t+1 \st,at)n(st + i\st,a t ). (3) 



We shall denote the set of all MDPs as M.. For any policy it that is an arbitrary 
distribution on actions, we can define a T-horizon value function for an MDP 



yu G M at time t as: 



V^(s,a) =E[r t+ i | s t =s, a t =a, //] 

+ 7 5Z^( S *+ 1=S ' I s t= s ^ a t= a )V^t+i,T( s ')- 
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Note that for the infinite-horizon case, liniT^oo VfT = V' 71 "'^ for all i. 

In the case where the MDP is unknown, it is possible to use a Bayesian 
framework to represent our uncertainty (c.f. [7]). This essentially works by 
maintaining a belief £t G 5, about which MDP fi G M corresponds to reality. 
In a Bayesian setting, £t(/x) is our subjective probability measure that \i is true. 

In order to optimally select actions in this framework, we need to use the ap- 
proach suggested originally in [3] under the name of Adaptive Control Processes. 
The approach was investigated more fully in [6,7]. This creates an augmented 
MDP, with a state comprised of the original MDP's state s t and our belief state 
£t. We can then solve the exploration in principle via standard dynamic pro- 
gramming algorithms such as backwards induction. We shall call such models 
Belief- Augmented MDPs, analogously to the Bayes- Adaptive MDPs of [7]. This 
is done by not only considering densities conditioned on the state-action pairs 
(s t , a t ), i.e. p(r t +i,s t +i\st,at), but taking into account the belief £ t G S, a prob- 
ability space over possible MDPs, i.e. augmenting the state space from S to S x S 
and considering the following conditional density: p{r~t+i , St+i , £t+i | s t ,at,^t)- 
More formally, we may give the following definition: 

Definition 1.2 (Belief- Augmented MDP) A Belief- Augmented MDP v (BAMPD) 
is an MDP v = (ft, A, T' , TV) where = 5xH, where S is the set of probability 
measures on M, and T' ,71' are the transition and reward distributions condi- 
tioned jointly on the MDP state s t , the belief state £i, and the action a t . Here 
£t(£t+i|rt+i, St+i) St, at) is singular, so that we can define the transition 



It should be obvious that s t , £t jointly form a Markov state in this setting, called 
the hyper-state. In general, we shall denote the components of a future hyper- 
state ail as (si, Q). However, in occassion we will abuse notation by referring to 
the components of some hypserstate lo as s^,^. We shall use Mb to denote 
the set of BMDPs. 

As in the MDP case, finite horizon problems only require sampling all future 
actions until the horizon T. 



However, because the set of hyper-states available at each time-step is necessar- 
ily different from those at other time-steps, the value function cannot be easily 
calculated for the infinite horizon case. 

In fact, the only clear solution is to continue expanding a belief tree until we 
are certain of the optimality of an action. As has previously been observed [4,5], 
this is possible since we can always obtain upper and lower bounds on the utility 
of any policy from the current hyper-state. We can apply such bounds on future 
hyper-states in order to efficiently expand the tree. 



p(wt + i\at,LJ t ) = p(s t+1 ,£ t+1 \a t ,s t ,£,t)- 




1.2 Related work 



Up to date, most work had only used full expansion of the belief tree up to a 
certain depth. A notable exception is [22], which uses Thompson sampling [20] 
to expand the tree. In very recent work [18], the importance of tree expansion in 
the closely related POMDP setting 1 has been recognised. Therein, the authors 
contrast and compare many different methods for tree expansion, including 
branch-and-bound [13] methods and Monte Carlo sampling. 

Monte Carlo sampling methods have also been recently explored in the upper 
confidence bounds on trees (UCT) algorithms, proposed in [8, 12] in the context 
of planning in games. Our case is similar, however we can take advantage of the 
special structure of the belief tree. In particular, for each node we can obtain 
a lower bound and a high-probability upper bound on the value of the optimal 
policy. 

This paper's contribution is to recognise that tree expansion in Bayesian 
exploration is itself an exploration problem with very special properties. Based 
on this insight, it proposes to combine sampling with lower bounds and upper 
bound estimates at the leaves. This allows us to obtain high-probability bounds 
for expansion of the tree. While the proposed methods are similar to the ones 
used in the discrete-state POMDP setting [18], the BAMDP requires the eval- 
uation of different bounds at leaf nodes. On the experimental side, wc present 
first results on bandit problems, for which nearly-optimal distribution-free al- 
gorithms are known. We believe that this is a very important step towards 
extending the applicability of Bayesian look-ahead methods in exploration. 

2 Belief tree expansion 

Let the current belief be £t and suppose we observe x\ = (s\ +1 , r l t+1 , a\). This 
observation defines a unique subsequent belief Together with the MDP 

state s, this creates a hyper-state transition from u t to By recursively 

obtaining observations for future beliefs, we can obtain an unbalanced tree with 
nodes {wj +fe : k = 1, . . . , T; i = 1, . . .}. However, we cannot hope to be able to 
fully expand the tree. This is especially true in the case where observations (i.e. 
states, rewards, or actions) are continuous, where we cannot perform even a full 
single-step expansion. Even in the discrete case the problem is intractable for 
infinite horizons - and far too complex computationally for the finite horizon 
case. However, had there been efficient tree expansion methods, this problem 
would be largely alleviated. The remainder of this section details bounds and 
algorithms that can be used to reduce the computational complexity of the 
Bayesian lookahead approach. 

2.1 Expanding a given node 

All tree search methods require the expansion of leaf nodes. However, in gen- 
eral, a leaf node may have an infinite number of children. We thus need some 
strategies to limit the number of children. 

lr The BAMDP setting is equivalent to a POMDP where the unobservable part of the state 
is stationary, but continuous (chap. 5 [7]) 



More formally, let us assume that we wish to expand in node oj\ = s\), 
with Q defining a density over M. For discrete state/action/reward spaces, 
we can simply enumerate all the possible outcomes {wj +1 }j^*' 4x ' R ' , where R is 
the set of possible reward outcomes. Note that if the reward is deterministic, 
there is only one possible outcome per state-action pair. The same holds if T is 
deterministic, in both cases making an enumeration possible. While in general 
this may not be the case, since rewards, states, or actions can be continuous, in 
this paper we shall only examine the discrete case. 

2.2 Bounds on the optimal value function 

At each point in the process, the next node oo\ +k to be expanded is the one 
maximising a utility U(ui\ +k ). Let J7t be the set of leaf nodes. If their values 
were known, then we could easily perform the backwards induction procedure 
shown in Algorithm 1. The main problem is obtaining a good estimate for V£, 



Algorithm 1 Backwards induction action selection 

1: procedure BACKWARDSlNDUCTION(i, u, T , V?) 
2-. for n = T -1,T - 2,..., t do 
3: for u) e Cl n do 

4: a{uj) = argmax a [E(r|a/,w,z/) +'yV* +1 (u , )]i/(u'\u,a) 

5: V*(w)= E W ' e n B+1 v{<S\u,<)\E(r\u/,u,v)+'yV* +1 (w')] 

6: end for 

7: end for 

8: return 

9: end procedure 



i.e. the value of leaf nodes. Let n*(fi) denote the policy such that, for any n, 

V^(s)>v;(s) VseS. 

Furthermore, let the maximum probability MDP arising from the belief at 
hyper-state w be ji^ = argmax^/i. Similarly, we denote the mean MDP with 

^ = E[^\U 

Proposition 2.1 The optimal value function at any leaf node ui is bounded by 
the following inequalities 

J Vf^MUvW > > J Vf^HsMridH- (5) 

Proof By definition, V*(uj) > V* (v) for all u, for any policy w. The lower 
bound follows trivially, since 

V w '^(u>) = J Vf^M&Mdt*. (6) 
The upper bound is derived as follows. First note that for any function /, 



sup x J f(x, u)du < J sup x f(x, u) du. Then, we remark that: 

V*(uj) = sup f VfM&iii) d/x (7a) 



< J mpVZM&Mdn (7b) 
= / VfM(sM»)d». (7c) 

I 

In POMDPs, a trivial lower bound can be obtained by calculating the value 
of the blind policy [9,19], which always takes the same action. Our lower bound 
is in fact the BAMDP analogue of the value of the blind policy in POMDPs. 
This is because for any fixed policy ir, it holds trivially that V n (u)) < V*(w). 
In our case, we have made this lower bound tighter by considering 7r*(/i w ), the 
policy that is greedy with respect to the current mean estimate. 

The upper bound itself is analogous to the POMDP value function bound 
given in Theorem 9 of [9] . However, while the lower bound is easy to compute in 
our case, the upper bound can only be approximated via Monte Carlo sampling 
with some probability. 

2.3 Calculating the lower bound 

The lower bound can be calculated by performing value iteration in the mean 
MDP. This is because, for any policy n and belief £, J (s)£(s) d/j, can be 
written as 

J S^[r\s, M , tt] + 7 ^ »(s'\s, tt(s))V;( S ') j = 

= Z>M a ){/ E[r| a ,o,^(/i)d/i + 7E / Ks'ha)V^( S ')^)d^ 

= £ *(a\s) f^[r\s, a, ft}} + 7 £ ft(s'\s, a)V^ (s% j . 

where ft is the mean MDP for belief £. If the beliefs £ can be expressed 
in closed form, it is easy to calculate the mean transition distribution and 
the mean reward from £. For discrete state spaces, transitions can be ex- 
pressed as multinomial distributions, to which the Dirichlct density is a con- 
jugate prior. In that case, for Dirichlct parameters {"0i' a (O ■ i,j & S,a & A}, 
we have ft(s'\s,a) = Vy' a (£)/ SieS ^i*' ^)- Similarly, for Bernoulli rewards, 
the corresponding mean model arising from the beta prior with parameters 
{a s > a (0,(3 s ' a (0 ■■ a e S,a e A} is E[r\s,a,ft] = a s ' a (0/(a s '°(£) + /9 s,a (0)- 
Then the value function of the mean model, and consequently, a lower bound 
on the optimal value function, can be found with standard value iteration. 

2.4 Upper bound with high probability 



In general, (7b) cannot be expressed in closed form. However, the integral can 
be approximated via Monte Carlo sampling. 



Let the leaf node which we wish to expand be lj. Then, we can obtain c MDP 
samples from the belief at u>: fii, . . . , jj, c <~ £ w (/■*)• For each ^ we can derive the 
optimal policy 7r*(/ifc) and estimate its value function v^, = V£ = V*. We 
may then average these samples to obtain 

(8) 

c fe=i 

Let v*(u>) = J M ^iAj(^)V*(suj)d^,. It holds that lim c ^oo[u c ] = S*(o>) and that 
E[w c ] = Due to the latter, we can apply a Hoeffding inequality 

P (|fi c » - w»| > 6) < 2cxp (- 2 ^ ) , (9) 

\ ^ *max "min J / 

thus bounding the error within which we estimate the upper bound. For r t G 
[0, 1] and discount factor 7, note that V max — V m i n < 1/(1 — 7). 

2.5 Bounds on parent nodes 

We can obtain upper and lower bounds on the value of every action a € A, at 
any part of the tree, by iterating over £l t , the set of possible outcomes following 

v(u t , a) = J2 P ^t I «H, a) [4 + 7«K)] (10) 
I a* 

fi*(w t ,a) = E P ( W * I w *'°) [ r * +7«*K)] , (11) 

i=l 

where the probabilities are implicitly conditional on the beliefs at each tj t . For 
every node, we can calculate an upper and lower bound on the value of all 
actions. Obviously, if at the root node lo u there exists some a* t such that 
v{uj u d* t ) > v*(u>t,a) for all a, then ajf is unambiguously the optimal action. 

However, in general, there may be some other action, a', whose upper bound 
is higher than the lower bound of <Xj. In that case, we should expand either one 
of the two trees. 

It is easy to see that the upper and lower bounds at any node lo± can be 
expressed as a function of the respective bounds at the leaf nodes. Let B(ui t ,a) 
be the set of all branches from cu t when action a is taken. For each branch 
b G B(u! t ,a), let £t(6) be the probability of the branch from oj t and u\ be the 
discounted cumulative reward along the branch. Finally, let L(uJt,a) be the 
set of leaf nodes reachable from u t and Wb be the specific node reachable from 
branch b. Then, upper or lower bounds on the value function can simply be 
expressed as v(uj t ,a) = J2beB(uj t a) u t + 7* 6 £w t (b)v{^b)- This would allow us to 
use a heuristic for greedily minimising the uncertainty at any branch. However, 
the algorithms we shall consider here will only employ evaluation of upper and 
lower bounds. 



3 Algorithms 

At each time step t, N expansions are performed, starting from state lo = 
omegat- At the n-th expansion, a utility function U is evaluated for every node 
lo in the set of leaf nodes L n . The main difference among the algorithms is the 
way U is calculated. 



Algorithm 2 General tree expansion algorithm 



1 

2 
3 
4 
5 
6 

7: 

8 
9 
10 
11 
12 



procedure ExpandTree(w, N) 
n = 1, lo = LO. 
L = K} 
while n < N do 

Take one more sample for v c (lo) from all leaf nodes L n 
Calculate U{lo{) for all i 6 L n . 

Calculate C{uob), the children of branch b = argmax 4 U(uJi 

L n +i = L n U (L n+ i\ujb). 

n + + 
end while 
return a 
end procedure 



1. Serial. This results in a nearly balanced tree, as the oldest leaf node is 
expanded next, i.e. U(u>i) = —i, the negative node index. 

2. Random. In this case, we expand any of the current leaf nodes with equal 
probability, i.e. f7(cjj) = U{uj) for all This can of course lead to 
unbalanced trees. 

3. Highest lower bound. We expand the node maximising a lower bound i.e. 

U(LOi)=^V(LOi). 

4. Thompson sampling. We expand the node for which the currently sampled 
upper bound is highest, i.e. U(oJi) = -f ti; u(LOi). 

5. High probability upper bound. We expand the node with the highest mean 
upper bound U(ui) = j u max fc {w* (i) (cji), v(ui)}. 

While methods 3 and 4 only use one sample from the upper bound calculation 
at every iteration. The last two methods retain the samples obtained in the 
previous iterations and use them to calculate the mean estimate. 



4 Experiments 

We compared the regret of the tree expansion to the optimal policy in bandit 
problems with Bernoulli rewards with two benchmarks: the UCB1 algorithm [2], 
which suffers only logarithmic regret, and secondly a Bayesian algorithm that is 
greedy with respect to a mean Bayesian estimate with a prior density Beta(0, 1), 
i.e. Algorithm 3 applied with lo = (0,£ t ), which is a simple optimistic heuristic 
for such problems. 



Algorithm 3 Greedy Bayesian action selection 
l: procedure BayesianGreedy(w, 7) 
2: Select a* — argmax V?(s M , a) 
3: end procedure 



We compared the algorithms using the notion of expected undiscounted re- 
gret accumulated over T time steps, i.e. the expected loss that a specific policy 
7r suffers over the policy which always chooses the arm a* with the highest mean 
reward: 

J2E[r\a t = a*}-J2nrW. 

T=l T=l 

In order to determine the expected regret experimentally we must perform mul- 
tiple independent runs and average over them. 

Figure 1 shows the cumulative undiscounted regret for horizon T = 2/(1 — 7), 
with 7 = 0.9999 and \A\ = 2, averaged over 1000 runs. We compare the UCB1 
algorithm (ucb), and the Bayesian baseline (base) with the BAMDP approach. 
The figure shows the cumulative undiscounted regret as a function of the number 
of look-aheads, for the following expansion algorithms: serial, random, highest 
lower bound (lower bound), and high probability upper bound(upper bound). 
The last two algorithms use 7-rate discounting for future node expansion. 

It is evident from these results that the highest lower bound method never 
improves beyond the first expansion. This is due to the fact that the lower 
bounds never change after the first step when this algorithm is used. The simple 
serial expansion seems to perform only slightly better. On the other hand, while 
the serial expansion is consistently better than the random expansion, it does 
not manage to achieve less than half of the regret of the latter. It thus appears as 
though the stochastic selection of branches is in itself of quite some importance 
in this type of problem. For problems with more arms and longer horizons, the 
differences between methods are amplified. The results are in agreement with 
those obtained in the POMDP setting, where upper bound expansions appear 
to be best [18]. 

5 Conclusion 

One of this paper's aims was to draw attention to the interesting problem of 
tree expansion in Bayesian RL exploration. To this end, bounds on the optimal 
value function at belief tree leaf nodes have been derived and then utilised 
as heuristics for tree expansion. It is shown experimentally that the resulting 
expansion methods have very significant differences in computational complexity 
for bandit problems. While the results are preliminary in the sense that no 
experiments on more complex problems are presented and that only very simple 
expansion algorithms have been tried, they are nevertheless significant in the 
sense that the effect of the tree exploration method used is very large. 

Apart from performing further experiments, especially with more sophisti- 
cated expansion algorithms, future work should include deriving bounds on the 
minimum and maximum depth reached for each algorithm, as well as more gen- 
eral regret bounds if possible. The regret could be measured either as simply 
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Figure 1: Cumulative undiscounted regret accrued over 2/(1 — 7) time-steps. 



the e, S optimality of a*, or, more interestingly, bounds on the cumulative on- 
line regret suffered by each algorithm. More importantly, problems with infinite 
observation spaces (i.e. with continuous rewards) should also be examined. 

My current work includes the analysis of the stochastic branch-and-bound 
algorithm such as the ones described in [11,14]. This algorithm is essentially the 
same as the high probability upper bound method used in the current paper. 
Another interesting approach would be to develop a new expansion algorithm 
that achieves a small anytime regret, perhaps in the lines of UCT [12]. Such 
algorithms have been very successful in solving problems with large spaces and 
may be useful in this problem as well, especially when the space of observations 
becomes larger. 
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A The Bayesian inference in detail 

Let M. be the set of MPDs with unknown transition probabilities and state 
space S of size n. We denote our belief at time t + 1 about which MDP is true 
as 

- €t{n\st+i,s t <h) (12a) 
fi(s t +i\st,a t ,n)tt(p) 



(12b) 



Since this is an infinite set of MDPs, we can have each MDP \i correspond to a 
particular probability distribution over the state-action pairs. More specifically, 
let us define for each state action pair s, a, a Dirichlet distribution 

with q s ^ a — P(st+i | st~s,at—a). We will denote by &(t) the matrix of state- 
action-state transition counts at time t, with ^(O) being the matrix defining our 
prior Dirichlet distriburtion. 

We shall now model the joint prior over transition distributions as simply the 
product of priors. Then we can denote the matrix of state-action-state transition 
probabilities for MDP \i as Q 11 and let a i = /j,(s t +i=i | s t =s, a t =a). Then 

&(/*) = it{Qn = &(«.,„ = < Vfl eS,aeA) (14a) 

= n n = o> ( Mb ) 

ses aeA 

-nn n^ nK./'"' <"=> 



where we assume that each state-action pair's transition distribution is inde- 
pendent of the other transition distributions. This means that ^ is a sufficient 
statistic for expressing the density over M. 

We can additionally model n(rt+i\st,at) with a suitable belief and assume 
independence. This in no way complicates the exposition for MDPs. 
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